Join of of multiple streams with the Python SDK - google-cloud-dataflow

I would like to join multiple streams on a common key and trigger a result either as soon as all of the streams have contributed at least one element or at the end of the window. CoGroupByKey seems to be the appropriate building block, but there does not seem to be a way to express the early trigger condition (count trigger applies per input collection)?

I believe CoGroupByKey is implemented as Flatten + GroupByKey under the hood. Once multiple streams are flattened into one, data-driven trigger (or any other triggers) won't have enough control to achieve what you want.
Instead of using CoGroupByKey, you can use Flatten and StatefulDoFn that fills an object backed by State for each key. Also in this case, StatefulDoFn would have the chance to decide what to do when stream A has 2 elements arrived but stream B doesn't have any element yet.

Another potential solution that comes to mind is (a stateless) DoFn that filters the CoGBK results to remove those that don't have at least one occurrence for each joined stream. For the end of window result (which does not have the same restriction), it would then be necessary to have a parallel CoGBK and its result would not go through the filter. I don't think there is a way to tag results with the trigger that emitted it?

Related

Event Store DB : temporal queries

regarding to asked question here :
suppose that we have ProductCreated and ProductRenamed events which both contain the title of the product.now we want to query EventStoreDB for all events of type ProductCreated and ProductRenamed with the given title.i want all these events to check whether there is any product in the system which has been created or renamed to the given title, so that i could throw the exception of repetitive title in the domain
i am using MongoDB for creating UI reports from all the published events and everything is fine there.but for checking some invariants, like checking for unique values, i have to either query the event store for some events along with their criteria and by iterating over them, decide whether there is a product created with the same title which has not renamed or a product renamed with the same title.
for such queries, the only way that event store provides is creating a one-time projection with the proper java script code which filters and emits required events to a new stream.and then all i have to do is to fetch events from the new generated stream which is filled by the projection
no the odd thing is, projections are great for subscriptions and generating new streams, but they seem to be odd for doing real time queries.immediately after i create a projection with the HTTP api, i check the new resulting stream for the query result, but it seems that the workers has not got the chance to elaborate on the result and i get 404 response.but after waiting for a bunch of seconds, the new streams pops out and gets filled with the result.
there are too many things wrong with this approach:
first, it seems that if the event store is filled with millions of events across many streams, it wont be able to process and filter all of them immediately to the resulting stream.it does not create the stream immediately, let alone the population.so i have to wait for some time and check for the result hoping the the projection is done
second, i have to fetch multiple times and issue multiple GET HTTP commands which seems to be slow.the new JVM client is not ready yet.
Third, i have to delete the resulting stream after i'm done with the result and failing to do so will leave event store with millions of orphan query result streams
i wish i could pass the java script to some api and get the result page by page like querying MongoDB without worrying about the projection, new streams and timing issues.
i have seen a query section in the Admin UI, but i dont know whats that for, and unfortunetly the documentation doesn't help much
am i expecting the event store to do something that is impossible?
do i have to create a bounded context inner read model for doing such checks?
i am using my events to dehyderate the aggregates and willing to use the same events for such simple queries without acquiring other techniques
I believe it would not be a separate bounded context since the check you want to perform belongs to the same bounded context where your Product aggregate lives. So, the projection that is solely used to prevent duplicate product names would be a part of the same context.
You can indeed use a custom projection to check it but I believe the complexity of such a solution would be higher than having a simple read model in MongoDB.
It is also fine to use an existing projection if you have one to do the check. It might be not what you would otherwise prefer if the aim of the existing projection is to show things in the UI.
For the collection that you could use for duplicates check, you can have the document schema limited to the id only (string), which would be the product title. Since collections are automatically indexed by the id, you won't need any additional indexes to support the duplicate check query. When the product gets renamed, you'd need to delete the document for the old title and add a new one.
Again, you will get a small time window when the duplicate can slip in. It's then up to the business to decide if the concern is real (it's not, most of the time) and what's the consequence of the situation if it happens one day. You'd be able to find a duplicate when projecting events quite easily and decide what to do when it happens.
Practically, when you have such a projection, all it takes is to build a simple domain service bool ProductTitleAlreadyExists.

Dynamic query usage whole streaming with google dataflow?

I have a dataflow pipeline that is set up to receive information(JSON) and transform it into a DTO and then insert it into my database. This works great for insert, but where I am running into issues is with handling delete records. With the information I am receiving there is a deleted tag in the JSON to specify when that record is actually being deleted. After some research/experimenting, I am at a loss as whether or not this is possible.
My question: Is there a way to dynamically choose(or change) what sql statement the pipeline is using, while streaming?
To achieve this with Dataflow you need to think more in terms of water flowing through pipes than in terms of if-then-else coding.
You need to classify your records into INSERTs and DELETEs and route each set to a different sink that will do what you tell them to. You can use tags for that.
In this pipeline design example, instead of startsWithATag and startsWithBTag you can use tags for Insert and Delete.

How to dedupe across over-lapping sliding windows in apache beam / dataflow

I have the following requirement:
read events from a pub sub topic
take a window of duration 30 mins and period 1 minute
in that window if 3 events for a given id all match match some predicate then i need to raise an event in a different pub sub topic
The event should be raised as soon as the 3rd event comes in for the grouping id as this is for detecting fraudulent behaviour. In one pane there many be many ids that have 3 events that match my predicate so i may need to emit multiple events per pane
I am able to write a function which consumes a PCollection does the necessary grouping, logic and filtering and emit events according to my business logic.
Questions:
The output PCollection contains duplicates due to the overlapping sliding windows. I understand this is the expected behaviour of sliding windows but how can I avoid this whilst staying in the same dataflow pipeline. I realise I could dedupe in an external system but that is just adding complexity to my system.
I also need to write some sort of trigger that fires each and every time my condition is reached in a window
Is dataflow suitable for this type of realtime detection scenario
Many thanks
You can rewindow the output PCollection into the global window (using the regular Window.into()) and dedupe using a GroupByKey.
It sounds like you're already returning the events of interest as a PCollection. In order to "do something for each event", all you need is a ParDo.of(whatever action you want) applied to this collection. Triggers do something else: they control what happens when a new value V arrives for a particular key K in a GroupByKey<K, V>: whether to drop the value, or buffer it, or to pass the buffered KV<K, Iterable<V>> for downstream processing.
Yes :)

Use an "IF statement" on the "ORDER BY" in a report in INFORMIX 4GL

I'm having a report which is called many times but I want the order to be different each time.
How can I change the "order by" depending on a variable.
For example
report print_label()
If is_reprint
then
order by rpt.item_code, rpt.description
else
order by rpt.description, rpt.item_code
end if
I tried passing in a variable when calling the report:
let scratch = "rpt.item_code, rpt.description"
start print_label(scratch)
And in the report I did:
order by scratch
But it didn't work...... Any other suggestions ??
Thank you!
The technique I've used for that type of problem is along the likes of
REPORT report_name(x)
DEFINE x RECORD
param1,param2, ..., paramN ...,
sort_method ...,
data ...
END RECORD
ORDER [EXTERNAL] BY x.param1, x.param2, ..., x.paramN
BEFORE GROUP OF x.param1
CASE
WHEN x.sort_method ...
PRINT ...
WHEN x.sort_method ...
PRINT ...
END CASE
BEFORE GROUP OF x.param2
# similar technique as above
...
BEFORE GROUP OF x.paramN
# similar technique as above
ON EVERY ROW
PRINT ...
AFTER GROUP OF x.paramN
# similar technique as above
...
AFTER GROUP OF x.param2
# similar technique as above
AFTER GROUP OF x.param1
# similar technique as above
... and then in the 4gl that calls the REPORT, populate x.param1, x.param2, ..., x.paramN with the desired parameters used for sorting e.g.
CASE x.sort_method
WHEN "product,branch"
LET x.param1 = x.data.product_code
LET x.param2 = x.data.branch_code
WHEN "branch,product"
LET x.param1 = x.data.branch_code
LET x.param2 = x.data.product_code
END CASE
OUTPUT TO REPORT report_name(x.*)
So as per my example, that's a technique I've seen and used for things like stock reports. The warehouse/branch/store manager wants to see things ordered by warehouse/branch/store, and then by product/sku/item, whilst a product manager wants to see things ordered by product/sku/item, and then warehouse/branch/store. More analytical reports with more potential parameters can be done using the same technique. I think the record I have seen is 6. So in that case, much better with 1 report covering all 6!=720 potential combinations, rather than writing a separate report for each possible order combination.
So probably similar to Jonathan option 1, although I don't have the same reservations about the complexity. I don't recall catching at code review any of my junior developers getting it badly wrong. In fact if the report is generic enough, you'll find that you don't need to touch it too often.
Short answer
The ORDER BY clause in an I4GL REPORT function has crucial effects on how the code implementing the report is generated. It is simply not feasible to rewire the generated code like that at run-time.
Therefore, you can't achieve your desired result directly.
Notes
Note that you should probably be using ORDER EXTERNAL BY rather than ORDER BY — the difference is that with EXTERNAL, the report can assume the data is presented in the correct order, but without, the report has to save up all the data (in a temporary table in the database), then select the data from the table in the required sorted order, making into into a two-pass report.
If you're brave and have the I4GL c-code compiler, you should take a look at the code generated for a report, but be aware it is some of the scariest code you're likely to encounter in a long time. It uses all sorts of tricks that you wouldn't dream of using yourself.
Workaround solutions — in outline
OK; so you can do it directly. What are your options? In my view, you have two options:
Use two parameters specifically for choosing the ordering, and then use an ORDER BY (without EXTERNAL) clause that always lists them in a fixed order. However, when it comes time to use the report, choose which sequence you want the arguments in.
Write two reports that differ only in the report name and the ORDER EXTERNAL BY clause. Arrange to call the correct report depending on which order you want.
Of these, option 2 is by far the simpler — except for the maintenance issue. Most likely, you'd arrange to generate the code from a single copy. That is, you'd save REPORT print_label_code_desc in one file, and then arrange to edit that into REPORT print_label_desc_code (use sed, for example) — and the edit would reverse the order of the names in the ORDER BY clause. This isn't all that hard to do in a makefile, though it requires some care.
Option 1 in practice
What does option 1 look like in practice?
DECLARE c CURSOR FOR
SELECT * FROM SomeTable
START REPORT print_label -- optional specification of destination, etc.
FOREACH c INTO rpt.*
IF do_item_desc THEN
OUTPUT TO REPORT print_label(rpt.item_code, rpt.description, rpt.*)
ELSE
OUTPUT TO REPORT print_label(rpt.description, rpt.item_code, rpt.*)
END IF
END FOREACH
FINISH REPORT print_label
The report function itself might look like:
REPORT print_label(col1, col2, rpt)
DEFINE col1 CHAR(40)
DEFINE col2 CHAR(40)
DEFINE rpt RECORD LIKE SomeTable.*
ORDER BY col1, col2
FORMAT
FIRST PAGE HEADER
…
BEFORE GROUP OF col1
…
BEFORE GROUP OF col2
…
ON EVERY ROW
…
AFTER GROUP OF col1
…
AFTER GROUP OF col2
…
ON LAST ROW
…
END REPORT
Apologies for any mistakes in the outline code; it is a while since I last wrote any I4GL code.
The key point is that the ordered by values are passed specially to the report and used solely for controlling its organization. You may need to
be able to print different details in the BGO (shorthand for BEFORE GROUP OF; AGO for AFTER GROUP OF) sections for the two columns. That will typically be handled by (gasp) global variables — this is I4GL and they are the normal way of doing business. Actually, they should be module variables rather than global variables if the report driver code (the code which calls START REPORT, OUTPUT TO REPORT and FINISH REPORT) is in the same file as the report itself. You need this because in general the reporting at the group levels (in the BGO and AGO blocks) will need different titles or labels depending on whether you're sorting code before description or vice versa. Note that the meaning of the group aggregates change depending on the order in the ORDER BY clause.
Note that not every report necessarily lends itself to such reordering. Simply running the BGO and AGO blocks in a different order is not sufficient to make the report output look sensible. In that case, you will fall back onto option 2 — or option 2A, which is write two separate reports that don't pretend to be just a reordering of the ORDER BY clause because the formatting of the data needs to be different depending on the ORDER BY clause.
As you can see, this requires some care — quite a bit more care than the alternative (option 2). If you use dynamic SQL to create the SELECT statement, you can arrange to put the right ORDER BY clause into the string that is then prepared so that the cursor will fetch the data in the correct order — allowing you to use ORDER EXTERNAL BY after all.
Summary
If you're a newcomer to I4GL, go with option 2. If your team is not reasonably experienced in I4GL, go with option 2. I don't like it very much, but it is the way that can be handled easily and is readily understood by yourself, your current colleagues, and those still to come.
If you're reasonably comfortable with I4GL and your team is reasonably experienced with I4GL — and the report layout really lends itself to being reorganized dynamically — then consider option 1. It is trickier, but I've done worse things in I4GL in times past.
You can have a case statement within the order by clause as follows :
order by
case
when 1 = 1 then
rpt.item_code, rpt.description
else
rpt.description, rpt.item_code
end
You can use prepare:
let query_txt="select ... "
If is_reprint then
let query_txt=query_txt clipped, " order by rpt.item_code,
rpt.description"
else
let query_txt=query_txt clipped, " order by rpt.description,
rpt.item_code"
end if
prepare statement1 from query_txt
declare cursor_name cursor for statement1
And now start report, use foreach etc, etc...
P.S. You must define query_txt as char long enough for whole text.

Querying temporal data in Neo4j

There are several possible ways I can think of to store and then query temporal data in Neo4j. Looking at an example of being able to search for recurring events and any exceptions, I can see two possibilities:
One easy option would be to create a node for each occurrence of the event. Whilst easy to construct a cypher query to find all events on a day, in a range, etc, this could create a lot of unnecessary nodes. It would also make it very easy to change individual events times, locations etc, because there is already a node with the basic information.
The second option is to store the recurrence temporal pattern as a property of the event node. This would greatly reduce the number of nodes within the graph. When searching for events on a specific date or within a range, all nodes that meet the start/end date (plus any other) criteria could be returned to the client. It then boils down to iterating through the results to pluck out the subset who's temporal pattern gives a date within the search range, then comparing that to any exceptions and merging (or ignoring) the results as necessary (this could probably be partially achieved when pulling the initial result set as part of the query).
Whilst the second option is the one I would choose currently, it seems quite inefficient in that it processes the data twice, albeit a smaller subset the second time. Even a plugin to Neo4j would probably result in two passes through the data, but the processing would be done on the database server rather than the requesting client.
What I would like to know is whether it is possible to use Cypher or Neo4j to do this processing as part of the initial query?
Whilst I'm not 100% sure I understand you requirement, I'd have a look at this blog post, perhaps you'll find a bit of inspiration there: http://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html

Resources