How to write a unittest for a sliding window? - google-cloud-dataflow

Does Dataflow provide a way for me to set the start point of the first window? Or is there a formula for computing the start point?
I'm trying to write a unittest for a composite transform that is applying a SlidingWindow, a GroupByKey, and then a DoFn.
My windows will be
[To + i * period, To + i * period + duration)
where To is the start of the first window, period is the period of the windows and duration is the duration of the window.
So without knowing To I can't precompute the expected values in the output and pass them to DataflowAssert to validate the result.

One work around would be to not use DataflowAssert. I could add two transforms to my test pipeline 1) one to attach the time window boundary to each data point and 2) one to write the data points to a temporary file.
After the pipeline runs, I can materialize the results by reading the temporary file. Since the data points are labeled with the end value of each window I can compute what the expected values should be.

Related

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

Beam CoGroupByKey with fixed window and event time based trigger generates random elements

I have a pipeline in Beam that uses CoGroupByKey to combine 2 PCollections, first one reads from a Pub/Sub subscription and the second one uses the same PCollection, but enriches the data by looking up additional information from a table, using JdbcIO.readAll. So there is no way there would be data in the second PCollection without it being there in the first one.
There is a fixed window of 10seconds with an event based trigger like below;
Repeatedly.forever(
AfterWatermark.pastEndOfWindow().withEarlyFirings(
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(40))
).withLateFirings(AfterPane.elementCountAtLeast(1))
);
The issue I am seeing is that when I stop the pipeline using the Drain mode, it seems to be randomly generating elements for the second PCollection when there has not been any messages coming in to the input Pub/Sub topic. This also happens randomly when the pipeline is running as well, but not consistent, but when draining the pipeline I have been able to consistently reproduce this.
Please find the variation in input vs output below;
You are using a non-deterministic triggering, which means the output is sensitive to the exact ordering in which events come in. Another way to look at this is that CoGBK does not wait for both sides to come in; the trigger starts ticking as soon as either side comes in.
For example, lets call your PCollections A and A' respectively, and assume they each have two elements a1, a2, a1', and a2' (of common provenance).
Suppose a1 and a1' come into the CoGBK, 39 seconds passes, and then a2 comes in (on the same key), another 2 seconds pass, then a2' comes in. The CoGBK will output ([a1, a2], [a1']) when the 40-second mark hits, and then when the window closes ([], [a2']) will get emitted. (Even if everything is on the same key, this could happen occasionally if there is more than a 40-second walltime delay going through the longer path, and will almost certainly happen for any late data (each side will fire separately).
Draining makes things worse, e.g. I think all processing time triggers fire immediately.

Question about SPSS modeler (There is an obstacle for make the stream run automatically)

I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!
In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.
I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/

Scale-building in SPSS

I'm using Cronbach's alpha to analyze data in order to build/refine a scale. This is a tedious process in SPSS, since it doesn't automatically optimize the scale, so I'm hoping there is a way to use syntax to speed it up.
So I start with a set of items, set up the OMS control panel to capture the item-total statistics table, and the run the alpha analysis. This pushes the item-total stats into a new dataset. Then I check the alpha value, and use it in syntax to screen out items that have a greater alpha-if-deleted value.
Then I re-run the analysis with only the items passed the screening. And I repeat, until all the items pass the screening. Here is the syntax:
* First syntax sets up OMS, and then runs the alpha analysis.
* In the reliability syntax, I have to manually add the variables and the Scale name.
* OMS.
DATASET DECLARE alpha_worksheet.
OMS
/SELECT TABLES
/IF COMMANDS=['Reliability'] SUBTYPES=['Item Total Statistics']
/DESTINATION FORMAT=SAV NUMBERED=TableNumber_
OUTFILE='alpha_worksheet' VIEWER=YES.
RELIABILITY
/VARIABLES=
points_18618
points_18618
points_3286
points_3290
points_3583
points_4018
points_7775
points_7789
points_7792
points_18631
points_18652
/SCALE('2017 Fall CRN 4157 Exam 01 v. 1.0') ALL
/MODEL=ALPHA
/SUMMARY=TOTAL.
* Second syntax identifies any variables in the OMS dataset that are LTE the alpha value.
* I have to manually enter the alpha value...
DATASET ACTIVATE alpha_worksheet.
IF (CronbachsAlphaifItemDeleted <= .694) Keep =1.
EXECUTE.
SORT CASES BY Keep(D).
Ideally, instead of having to repeat this process over and over, I'd like syntax that would automate this process.
Hope that makes sense, and if you have a solution thanks in advance (this has been bugging me for years!) Cheers

Real-time pipeline feedback loop

I have a dataset with potentially corrupted/malicious data. The data is timestamped. I'm rating the data with a heuristic function. After a period of time I know that all new data items coming with some IDs needs to be discarded and they represent a significant portion of data (up to 40%).
Right now I have two batch pipelines:
First one just runs the rating over the data.
The second one first filters out the corrupted data and runs the analysis.
I would like to switch from batch mode (say, running every day) into an online processing mode (hope to get a delay < 10 minutes).
The second pipeline uses a global window which makes processing easy. When the corrupted data key is detected, all other records are simply discarded (also using the discarded keys from previous days as a pre-filter is easy). Additionally it makes it easier to make decisions about the output data as during the processing all historic data for a given key is available.
The main question is: can I create a loop in a Dataflow DAG? Let's say I would like to accumulate quality-rates given to each session window I process and if the rate sum is over X, some a filter function in earlier stage of pipeline should filter out malicious keys.
I know about side input, I don't know if it can change during runtime.
I'm aware that DAG by definition cannot have cycle, but how achieve same result without it?
Idea that comes to my mind is to use side output to mark ID as malicious and make fake unbounded output/input. The output would dump the data to some storage and the input would load it every hour and stream so it can be joined.
Side inputs in the Beam programming model are windowed.
So you were on the right path: it seems reasonable to have a pipeline structured as two parts: 1) computing a detection model for the malicious data, and 2) taking the model as a side input and the data as a main input, and filtering the data according to the model. This second part of the pipeline will get the model for the matching window, which seems to be exactly what you want.
In fact, this is one of the main examples in the Millwheel paper (page 2), upon which Dataflow's streaming runner is based.

Resources