Apache Beam - Delta between windows - google-cloud-dataflow

Apologies, in trying to be concise and clear my previous description of my question turned into a special case of the general case I'm trying to solve.
New Description
I'm trying to Compare the last emitted value of an Aggregation Function (Let's say Sum()) with a each element that I aggregate over in the current window.
Worth noting, that the ideal (I think) solution would include
The T2(from t-1) element used at time = t is the one that was created during the previous window.
I've been playing with several ideas/experiments but I'm struggling to find a way to accomplish this in a way is elegant and "empathetic" to Beam's compute model (which I'm still trying to fully Grock after many an article/blog/doc and book :)
Side inputs seem unwieldy because It looks like I have to shift the emitted 5M#T-1 Aggregation's timestamp into the 5M#T's window in order to align it with the current 5M window
In attempting this with side inputs (as I understand them), I ended up with some nasty code that was quite "circularly referential", but not in an elegant recursive way :)
Any help in the right direction would be much appreciated.
Edit:
Modified diagram and improved description to more clearly show:
the intent of using emitted T2(from t-1) to calculate T2 at t
that the desired T2(from t-1) used to calculate T2 is the one with the correct key

Instead of modifying the timestamp of records that are materialized so that they appear in the current window, you should supply a window mapping fn which just maps the current window on to the past one.
You'll want to create a custom WindowFn which implements the window mapping behavior that you want paying special attention to overriding the getDefaultWindowMappingFn function.
Your pipeline would be like:
PCollection<T> mySource = /* data */
PCollectionView<SumT> view = mySource
.apply(Window.into(myCustomWindowFnWithNewWindowMappingFn))
.apply(Combine.globally(myCombiner).asSingletonView());
mySource.apply(ParDo.of(/* DoFn that consumes side input */).withSideInputs(view));
Pay special attention to the default value the combiner will produce since this will be the default value when the view has had no data emitted to it.
Also, the easiest way to write your own custom window function is to copy an existing one.

Related

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

Question about SPSS modeler (There is an obstacle for make the stream run automatically)

I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!
In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.
I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/

How can I emit summary data for each window even if a given window was empty?

It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults which will then emit nothing if the window was empty.
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection with a heartbeat PCollection consisting of empty values. So for the example of Sum.integersGlobally, you would Flatten your main PCollection<Integer> with a secondary PCollection<Integer> that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows or SlidingWindows; Sessions are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource that accepts an enumerable WindowFn plus a default value and generates an appropriate unbounded PCollection.

Filter DOORS on historical data

Is there a way to filter based on historical data?
For example: "Show me all objects who had "Attribute_X" == True on 01/01/2013"
As Steve stated, this would require an advanced DXL script.
I'm not sure about creating a filter on this, but identifying those objects you are looking for, I might be able to help. Having recently solved a similar task, I recommend to start with Tony Goodman's really excellent Smart History Viewer (this code could be used as DXL tutorial!) which has almost all the code you need. You just need to find and understand it.
Let me elaborate. Besides other nifty stuff, the history viewer basically does:
For all (selected) baselines, explicitly including un-baselined current version: gather all module changes and put them into a two-dimensional Skip list each, for module/object/session changes. Focus on the object changes.
There is an unused function printObjectHistory in the code which helps understanding the data structures. Have a look at the inner loop
for hist in skipHistory do
Inside this loop, consider only changes which happened before "01/01/2013" (check hist->HIST_DATE to obtain this information). The history viewer code already classified the detected changes, so you want to watch out for changes which contain the string "Modify Attribute: Attribute_X". Assign the new value to a buffer. Outside this loop, check if the buffer contains "True". If so, you this is one of the objects you wanted to find.

Store Redundant Info vs. Repeated Conversions

Is it preferable to store redundant information, (which can be otherwise generated from existing data,) or to instead convert the existing data each time you need access?
I've simplified my specific problem as best as I can below, hoping that the provided answers are useful as future-reference material.
Example:
Let's say we've developed a program that places data into Squares on a grid (like a super-descriptive game of Tic-Tac-Toe or something) and assigns various details, and a unique identification number to each:
Throughout our program, we often perform logic based on a square's X and/or Y coordinates (checking for 3 in a row) and other times we only need the ID (perhaps to access a string at "SquareName[ID]") - We aren't exactly certain which of these two is accessed more often, but it's a rather close competition.
Up until now we've simply stored the ID inside the square class, and converted it with some simple formulas whenever just the X or Y are needed. Say we want to get coordinates for one square in particular:
int CurrentX = (this.Square.ID - 1) % 3) + 1; // X coordinate, 1 through 3
int CurrentY = (this.Square.ID + 1) / 3; // Y, 1 through 3
Since the squares don't move around or change ID after setup, part of me believes it would be simpler just to store all 3 values inside the Square class, but my other part cringes at the redundancy since access to X and Y is already easy enough to calculate from the existing ID.
(Note, This program itself is not very memory or resource intensive, nor does the size of the grid get much larger, so it mostly comes down to which option is a better practice or rule of thumb.)
What would you do?
As a rule of thumb, for a system where the data is read/write, store your basic data without redundancy.
When performance or other considerations become a practical issue, then you should denormalize as necessary. (i.e. wait for it to be a problem, don't pre-optimize overly much).
Your goal should be the most maintainable code possible. That usually means writing the least code possible. Having extra code to maintain redundant copies of data points will make your code more brittle.
If those are values which can be determined at the moment of creation and then do not change anymore, I would go for variables populated in the constructor. It's not redundant info in so far as that it isn't stored anywhere else, but that's not my main point. When reading my code, I'd usually expect that whenever something is computed at the time of request, it might change per request. It is easy to find the point in the source where the field is populated and where it is changed, especially if it does never change, but you might end up slightly confused when looking at some calculation which will return always the same result, as it's variables can't change, and wonder whether you're just missing a case or this is really static.
Also, using a descriptive variable name, you can get rid of the comments. Not that I generally aim at not commenting, but source code which doesn't even need comments is a pretty save signal for easy to understand code, which might (/should) be your aim.

Resources